Creating a Real-Time, Reproducible Event Dataset

نویسنده

  • John Beieler
چکیده

The generation of political event data has remained much the same since the mid-1990s, both in terms of data acquisition and the process of coding text into data. Since the 1990s, however, there have been significant improvements in open-source natural language processing software and in the availability of digitized news content. This paper presents a new, next-generation event dataset, named Phoenix, that builds from these and other advances. This dataset includes improvements in the underlying news collection process and event coding software, along with the creation of a general processing pipeline necessary to produce daily-updated data. This paper provides a face validity checks by briefly examining the data for the conflict in Syria, and a comparison between Phoenix and the Integrated Crisis Early Warning System data. 1 Moving Event Data Forward Automated coding of political event data, or the record of who-did-what-towhom within the context of political actions, has existed for roughly two decades. The approach has remained largely the same during this time, with the underlying coding procedures not updating to reflect changes in natural language processing (NLP) technology. These NLP technologies have now advanced to such a level, and with accompanying open-source software implementations, that their inclusion in the event-data coding process comes as an obvious advancement. When combined with changes in how news content is obtained, the ability to store and process large amounts of text, and enhancements based on two decades worth of event-data experience, it becomes clear that political event data is ready for a next generation dataset. In this chapter, I provide the technical details for creating such a nextgeneration dataset. The technical details lead to a pipeline for the production of the Phoenix event dataset. The Phoenix dataset is a daily updated, nearreal-time political event dataset. The coding process makes use of open-source NLP software, an abundance of online news content, and other technical advances made possible by open-source software. This enables a dataset that is transparent and replicable, while providing a more accurate coding process than previously possible. Additionally, the dataset’s near-real-time nature also enables many applications that were previously impossible with batchupdated datasets, such as monitoring of ongoing events. Thus, this dataset

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Real-time quality monitoring in debutanizer column with regression tree and ANFIS

A debutanizer column is an integral part of any petroleum refinery. Online composition monitoring of debutanizer column outlet streams is highly desirable in order to maximize the production of liquefied petroleum gas. In this article, data-driven models for debutanizer column are developed for real-time composition monitoring. The dataset used has seven process variables as inputs and the outp...

متن کامل

CIFAR10-DVS: An Event-Stream Dataset for Object Classification

Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular compute...

متن کامل

Toward Real Event Detection

News agencies and other news providers or consumers are confronted with the task of extracting events from news articles. This is done i) either to monitor and, hence, to be informed about events of specific kinds over time and/or ii) to react to events immediately. In the past, several promising approaches to extracting events from text have been proposed. Besides purely statistically-based ap...

متن کامل

Coupled Sparse Nmf vs. Random Forest Classification for Real Life Acoustic Event Detection

In this paper, we propose two methods for polyphonic Acoustic Event Detection (AED) in real life environments. The first method is based on Coupled Sparse Non-negative Matrix Factorization (CSNMF) of spectral representations and their corresponding class activity annotations. The second method is based on Multi-class Random Forest (MRF) classification of time-frequency patches. We compare the p...

متن کامل

Real-time Prediction and Synchronization of Business Process Instances using Data and Control Perspective

Nowadays, in a competitive and dynamic environment of businesses, organizations need to moni-tor, analyze and improve business processes with the use of Business Process Management Systems(BPMSs). Management, prediction and time control of events in BPMS is one of the major chal-lenges of this area of research that has attracted lots of researchers. In this paper, we present a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1612.00866  شماره 

صفحات  -

تاریخ انتشار 2016